To address the issue that traditional Sequential Pattern Mining (SPM) does not consider pattern repetition and ignores the effects of utility (unit price or profit) and pattern length on user interest, a Top-k One-off high average Utility sequential Pattern mining (TOUP) algorithm was proposed. The TOUP algorithm mainly includes two core steps: average utility calculation and candidate pattern generation. Firstly, a CSP (Calculation Support of Pattern) algorithm based on the occurrence position of each item and the item repetition relation array was proposed to calculate pattern support, thereby achieving rapid calculation of the average utility of patterns. Secondly, candidate patterns were generated by itemset extension and sequence extension, and a maximum average utility upper bound was proposed. Based on this upper bound, effective pruning of candidate patterns was achieved. Experimental results on five real datasets and one synthetic dataset show that compared to the TOUP-dfs and HAOP-ms algorithms, TOUP algorithm reduces the number of candidate patterns by 38.5% to 99.8% and 0.9% to 77.6%, respectively, and decreases the running time by 33.6% to 97.1% and 57.9% to 97.2%, respectively. Therefore, the algorithm performance of TOUP is better, and it can mine patterns of interests to users more efficiently.
Accurate prediction of taxi demands between urban regions can provide decision support information for taxi guidance and scheduling as well as passenger travel recommendation, so as to optimize the relation between taxi supply and demand. However, most of the existing models only focus on modeling and predicting the taxi demands within a region, do not consider enough the spatial-temporal correlation between regions, and pay less attention to the more fine-grained demand prediction between regions. To solve the above problems, a prediction model for taxi demands between urban regions — Origin-Destination fusion with Spatial-Temporal Network (ODSTN) model was proposed. In this model, complex spatial-temporal correlations between regions was captured from spatial dimensions of the regions and region pairs respectively and three temporal dimensions of recent, daily and weekly periods by using graph convolution and attention mechanism, and a new path perception fusion mechanism was designed to combine the multi-angle features and finally realize the taxi demand prediction between urban regions. Experiments were carried out on two real taxi order datasets in Chengdu and Manhattan. The results show that the Mean Absolute Error (MAE), Root Mean Square Error (RMSE) and Mean Absolute Percentage Error (MAPE) of ODSTN model are 0.897 1, 3.527 4, 50.655 6% and 0.589 6, 1.163 8, 61.079 4%, respectively, indicating that ODSTN model has high accuracy in taxi demand prediction tasks.
In Hematoxylin-Eosin (HE)-stained pathological images, the uneven distribution of cell staining and the diversity of various tissue morphologies bring great challenges to automated segmentation. Traditional convolutions cannot capture the correlation features between pixels in a large neighborhood, making it difficult to further improve the segmentation performance. Therefore, a Multi-Channel Segmentation Network with gated axial self-attention (MCSegNet) model was proposed to achieve accurate segmentation of nuclei in pathological images. In the proposed model, a dual-encoder and decoder structure was adopted, in which the axial self-attention encoding channel was used to capture global features, while the convolutional encoding channel based on residual structure was used to obtain local fine features. The feature representation was enhanced by feature fusion at the end of the encoding channel, providing a good information base for the decoder. And in the decoder, segmentation results were gradually generated by cascading multiple upsampling modules. In addition, the improved hybrid loss function was used to alleviate the common problem of sample imbalance in pathological images effectively. Experimental results on MoNuSeg2020 public dataset show that the improved segmentation method is 2.66 percentage points and 2.77 percentage points higher than U-Net in terms of F1-score and Intersection over Union (IoU) indicators, respectively, and effectively improves the pathological image segmentation effect and the reliability of clinical diagnosis.
Process mining can build process model according to event logs generated by enterprise information management system. There always exist some deviations between the process model and event logs when the actual business process changes. At this time, the process model needs to be repaired. For the process model with concurrent structures, the precision of some existing repairing methods will be reduced because of the addition of self-loops and invisible transitions. Therefore, a method for repairing process models with concurrent structures was proposed on the basis of logic Petri net and token replay. Firstly, according to the relationship between the input-output places of the sub-model and event logs, the insertion position of the sub-model was determined. Then, the deviation positions were determined by a token replay method. Finally, a method was designed to repair the process models based on logical Petri net. The correctness and effectiveness of this method were verified by carrying out simulations on ProM platform, and the proposed method was compared with Fahland’s and other methods. The results show that the precision of this method is about 85%, which is increased by 17 and 11 percentage points respectively compared with those of Fahland’s and Goldratt methods, In the terms of simplicity, the proposed method does not add any self-loop or invisible transition, while Fahland’s and Goldratt methods add some self-loops and invisible transitions. All of the fitting degrees of the three methods are above 0.9, and the fitting degree of Goldratt method is slightly lower. The above verifies that the model repaired by the proposed method has higher fitness and precision.
The traditional Deep Learning (DL)-based multi-objective solvers have the problems of low model utilization and being easy to fall into the local optimum. Aiming at these problems, a Multi-objective Optimization model for Unmanned aerial vehicles Trajectory based on Decomposition and Trajectory search (DTMO-UT) was proposed. The proposed model consists of the encoding and decoding parts. First, a Device encoder (Dencoder) and a Weight encoder (Wencoder) were contained in the encoding part, which were used to extract the state information of the Internet of Things (IoT) devices and the features of the weight vectors. And the scalar optimization sub-problems that were decomposed from the Multi-objective Optimization Problem (MOP) were represented by the weight vectors. Hence, the MOP was able to be solved by solving all the sub-problems. The Wencoder was able to encode all sub-problems, which improved the utilization of the model. Then, the decoding part containing the Trajectory decoder (Tdecoder) was used to decode the encoding features to generate the Pareto optimal solutions. Finally, to alleviate the phenomenon of greedy strategy falling into the local optimum, the trajectory search technology was added in trajectory decoder, that was generating multiple candidate trajectories and selecting the one with the best scalar value as the Pareto optimal solution. In this way, the exploration ability of the trajectory decoder was enhanced during trajectory planning, and a better-quality Pareto set was found. The results of simulation experiments show that compared with the mainstream DL MOP solvers, under the condition of 98.93% model parameter quantities decreasing, the proposed model reduces the distribution of MOP solutions by 0.076%, improves the ductility of the solutions by 0.014% and increases the overall performance by 1.23%, showing strong ability of practical trajectory planning of DTMO-UT model.
Aiming at the problem that the existing contrast sequential pattern mining methods mainly focus on character sequence datasets and are difficult to be applied to time series datasets, a new Contrast Order-preserving Pattern Mining (COPM) algorithm was proposed. Firstly, in the candidate pattern generation stage, a pattern fusion strategy was used to reduce the number of candidate patterns. Then, in the pattern support calculation stage, the support of super-pattern was calculated by using the matching results of sub-patterns. Finally, a dynamic pruning strategy of minimum support threshold was designed to further effectively prune the candidate patterns. Experimental results show that on six real time series datasets, the memory consumption of COPM algorithm is at least 52.1% lower than that of COPM-o (COPM-original) algorithm, 36.8% lower than that of COPM-e (COPM-enumeration) algorithm, and 63.6% lower than that of COPM-p (COPM-prune) algorithm. At the same time, the running time of COPM algorithm is at least 30.3% lower than that of COPM-o algorithm, 8.8% lower than that of COPM-e algorithm and 41.2% lower than that of COPM-p algorithm. Therefore, in terms of algorithm performance, COPM algorithm is superior to COPM-o, COPM-e and COPM-p algorithms. The experimental results verify that COPM algorithm can effectively mine the contrast order-preserving patterns to find the differences between different classes of time series datasets.
Attribute reduction is a hot research topic in rough set theory. Most of the algorithms of attribute reduction for continuous data are based on dominance relations or neighborhood relations. However, continuous datasets do not necessarily have dominance relations in attributes. And the attribute reduction algorithms based on neighborhood relations can adjust the granulation degree through neighborhood radius, but it is difficult to unify the radii due to the different dimensions of attributes and the continuous values of radius parameters, resulting in high computational cost of the whole parameter granulation process. To solve this problem, a multi-granularity attribute reduction strategy based on cluster granulation was proposed. Firstly, the similar samples were classified by the clustering method, and the concepts of approximate set, relative positive region and positive region reduction based on clustering were proposed. Secondly, according to JS (Jensen-Shannon) divergence theory, the difference of data distribution of each attribute among clusters was measured, and representative features were selected to distinguish different clusters. Finally, an attribute reduction algorithm was designed using a discernibility matrix. In the proposed algorithm, the attributes were not required to have ordered relations. Different from neighborhood radius, the clustering parameter was discrete, and the dataset was able to be divided into different granulation degrees by adjusting this parameter. Experimental results on UCI and Kent Ridge datasets show that this attribute reduction algorithm can directly deal with continuous data. At the same time, by using this algorithm, the redundant features in the datasets can be removed while maintaining or even improving the classification accuracy by discrete adjustment of the parameters in a small range.
In view of the problems of the current Convolutional Neural Network (CNN) using end layer features to recognize facial expression, such as complex model structure, too many parameters and unsatisfactory recognition, an optimization algorithm based on the combination of improved CNN and Support Vector Machine (SVM) was proposed. First, the network model was designed by the idea of continuous convolution to obtain more nonlinear activations. Then, the adaptive Global Average Pooling (GAP) layer was used to replace the fully connected layer in traditional CNN to reduce the network parameters. Finally, in order to improve generalization ability of the model, SVM classifier instead of the traditional Softmax function was used to realize expression recognition. Experimental results show that the proposed algorithm achieves 73.4% and 98.06% recognition accuracy on Fer2013 and CK+ datasets, which is 2.2 percentage points higher than the traditional LeNet-5 algorithm on Fer2013 dataset. Moreover, this network model has simple structure, less parameters and good robustness.
The parity blocks of the Maximum-Distance-Separable (MDS) code are all global parity blocks. The length of the reconstruction chain increases with the expansion of the storage system, and the reconstruction performance gradually decreases. Aiming at the above problems, a new type of Non-Maximum-Distance-Separable (Non-MDS) code called local redundant hybrid code Code-LM(s,c) was proposed. Firstly, two types of local parity blocks called horizontal parity block in the strip-set and horizontal-diagonal parity block were added in any strip-sets to reduce the length of the reconstruction chain, and the parity layout of the local redundant hybrid code was designed. Then, four reconstruction formulations of the lost data blocks were designed according to the generation rules of the parity blocks and the common block existed in the reconstruction chains of different data blocks. Finally, double-disk failures were divided into three situations depending on the distances of the strip-sets where the failed disks located and the corresponding reconstruction methods were designed. Theoretical analysis and experimental results show that with the same storage scale, compared with RDP (Row-Diagonal Parity), the reconstruction time of CodeM(s,c) for single-disk failure and double-disk failure can be reduced by 84% and 77% respectively; compared with V2-Code, the reconstruction time of Code-LM(s,c) for single-disk failure and double-disk failure can be reduced by 67% and 73% respectively. Therefore, local redundant hybrid code can support fast recovery from failed disks and improve reliability of storage system.
Aiming at the issue of ineffective fusion of multi-modal features of indoor scene semantic segmentation using RGB-D, a network named APFNet (Attention mechanism and Pyramid Fusion Network) was proposed, in which attention mechanism fusion module and pyramid fusion module were designed. To fully use the complementarity of the RGB features and the Depth features, the attention allocation weights of these two kinds of features were respectively extracted by the attention mechanism fusion module, making the network focus more on the multi-modal feature domain with more information content. Local and global information were fused by pyramid fusion module with four different scales of pyramid features, thus scene context was extracted and segmentation accuracies of object edges and small-scale objects were improved. By integrating these two fusion modules into a three-branch “encoder-decoder” network, an “end-to-end” output was realized. Comarative experiments were implemented with the state-of-the-art methods, such as multi-level RGB-D residual feature Fusion network (RDF-152), Attention Complementary features Network (ACNet) and Spatial information Guided convolution Network (SGNet) on the SUN RGB-D and NYU Depth v2 datasets. Compared with the best-performing method RDF-152, when the layer number of the encoder network was reduced from 152 to 50, the Pixel Accuracy (PA), Mean Pixel Accuracy (MPA), and Mean Intersection over Union (MIoU) of APFNet were respectively increased by 0.4, 1.1 and 3.2 percentage points. The semantic segmentation accuracies for small-scale objects such as pillows and photos, and large-scale objects such as boards and ceilings were increased by 0.9 to 3.4 and 12.4 to 18 percentage points respectively. The results show that the proposed APFNet has some advantages in dealing with the semantic segmentation of indoor scenes.
With the massive growth of data, how to store and use data has become a hot issue in academic research and industrial applications. As one of the methods to solve these problems, instance selection effectively reduces the difficulty of follow-up work by selecting representative instances from original data according to the established rules. Therefore, a voting instance selection algorithm based on learning to hash was proposed. Firstly, the Principal Component Analysis (PCA) method was used to map high-dimensional data to low-dimensional space. Secondly, the k-means algorithm was used to perform iterative operations by combining with the vector quantization method, and the hash codes of the cluster center were used to represent the data. After that, the classified data were randomly selected according to the proportion, and the final instances were selected by voting after several times independent running of the algorithm. Compared with the Compressed Nearest Neighbor (CNN) algorithm and the instance selection algorithm of linear complexity for big data named LSH-IS-F (Instance Selection algorithm by Hashing with two passes), the proposed algorithm has the compression ratio improved by an average of 19%. The idea of the proposed algorithm is simple and easy to implement, and the algorithm can control the compression ratio automatically by adjusting the parameters. Experimental results on 7 datasets show that the proposed algorithm has a great advantage compared to random hashing in terms of compression ratio and running time with similar test accuracy.
Meta-learning is the learning process of applying machine learning methods (meta-algorithms) to seek the mapping between features of a problem (meta-features) and relative performance measures of the algorithm, thereby forming the learning process of meta-knowledge. How to construct and extract meta-features is an important research content. Concerning the problem that most of meta-features used in the existing related researches are statistical features of data, uncertainty modeling was proposed and the impact of uncertainty on learning system was studied. Based on inconsistency of data, complexity of boundary, uncertainty of model output, linear capability to be classified, degree of attribute overlap, and uncertainty of feature space, six kinds of uncertainty meta-features were established for data or models. At the same time,the uncertainty size of the learning problem itself was measured from different perspectives, and specific definitions were given. The correlations between these meta-features were analyzed on artificial datasets and real datasets of a large number of classification problems, and multiple classification algorithms such as K-Nearest Neighbor (KNN) were used to conduct a preliminary analysis of the correlation between meta-features and test accuracy. Results show that the average degree of correlation is about 0.8, indicating that these meta-features have a significant impact on learning performance.
In order to study the problem about data combinations occurring in real life, different kinds of data information were combined together, leading to a structure called associated-combinatorial structure. Actually, the structure was constituted by a data set, an associated relation and a partition. The aim was to use the structure to set up a method of data combination. To this end, the associated-combinatorial structure was transformed into a granulation structure by granulating the associated relation. In this process, data combinations were completed in accordance with the data classifications. Moreover, because an associated-combinatorial structure or a granulation structure could be represented by the associated matrix, the transformation from a structure to another structure was characterized by algebraic calculations determined by matrix transformations. Therefore, the research not only involved theoretical analysis for the data combination, but also established the data processing method connected with matrix transformations. Accordingly, a computer program with linear complexity was formulated according to the data combinations method. The experimental result proves that the program is accurate and fast.
To solve the problem of Fine Particulate Matter (PM2.5) concentration prediction, a PM2.5 concentration prediction model was proposed. First, through introducing the comprehensive meteorological index, the factors of wind, humidity, temperature were comprehensively considered; then the feature vector was conducted by combining the actual concentration of SO2, NO2, CO and PM10; finally the Least Squares Support Vector Machine (LS-SVM) prediction model was built based on feature vector and PM2.5 concentration data. The experimental results using the data from the city A and city B environmental monitoring centers in 2013 show that, the forecast accuracy is improved after the introduction of a comprehensive weather index, error is reduced by nearly 30%. The proposed model can more accurately predict the PM2.5 concentration and it has a high generalization ability. Furthermore, the author analyzed the relationship between PM2.5 concentration and the rate of hospitalization, hospital outpatient service amount, and found a high correlation between them.
To meet the application demand of high speed scanning and massive data transmission in industrial Computed Tomography (CT) of low-energy X-ray, a system of high-speed data acquisition and transmission for low-energy X-ray industrial CT was designed. X-CARD 0.2-256G of DT company was selected as the detector. In order to accommodate the needs of high-speed analog to digital conversion, high-speed time division multiplexing circuit and ping-pong operation for the data cache were combined; a gigabit Ethernet design was conducted with Field Programmable Gate Array (FPGA) selected as the master chip,so as to meet the requirements of high-speed transmission of multi-channel data. The experimental result shows that the speed of data acquisition system reaches 1MHz, the transmission speed reaches 926Mb/s and the dynamic range is greater than 5000. The system can effectively shorten the scanning time of low energy X-ray detection, which can meet the requirements of data transmission of more channels.
Currently, the query of transmission lines monitoring system in smart grid is mostly aiming at the global query of Wireless Sensor Network (WSN), which cannot satisfy the flexible and efficient query requirements based on any area. The layout and query characteristics of network were analyzed in detail, and a query algorithm based on mesh structure in large-scale smart grid named MSQuery was proposed. The algorithm aggregated the data of query nodes within different grids to one or more logical query trees, and an optimized path of collecting query result was built by the merging strategy of the logical query tree. Experiments were conducted among MSQuery, RSA which used routing structure for querying and SkySensor which used cluster structure for querying. The simulation results show that MSQuery can quickly return the query results in query window, reduce the communication cost, and save the energy of sensor nodes.
Aiming at the nonlinear modeling of Power Amplifier (PA), an improved Radial Basis Function Neural Networks (RBFNN) model was proposed. Firstly, time-delay of cross terms and output feedback were added in the input. Parameters (weigths and centers) of the proposed model were extracted using the Orthogonal Least Square (OLS) algorithm. Then Doherty PA was trained and validated successfully by 15MHz three-carrier Wideband Code Division Multiple Access (WCDMA) signal, and the Normalized Mean Square Error (NMSE) can reach -45dB. Finally, the inverse class F power amplifier was used to test the universality of the model. The simulation results show that the model can more truly fit characteristics of power amplifier.